GPU-Accelerated Embedding Visualization with RAPIDS¶
This notebook loads pre-extracted embeddings and visualizes them using RAPIDS cuML for GPU-accelerated UMAP/t-SNE dimensionality reduction and Plotly for interactive 2D/3D visualizations.
📊 Overview¶
- Data Source: Pre-extracted embeddings from
embeddings_output/(parquet files) - GPU Acceleration: RAPIDS cuDF + cuML for 100x faster processing
- Algorithms: cuML UMAP and t-SNE (GPU-accelerated)
- Visualization: Interactive 2D/3D Plotly scatter plots
- Labels: Flexible labeling from dataset metadata, cuBERT, or None
🎯 Features¶
- ✅ GPU-Accelerated: cuML UMAP/t-SNE runs entirely on GPU
- ✅ Memory Efficient: cuDF for GPU DataFrame operations
- ✅ Scalable: Handle millions of embeddings efficiently
- ✅ Interactive: Plotly 2D/3D visualizations with hover details
- ✅ Flexible Labels: Support for metadata labels, cuBERT clustering, or unlabeled
- ✅ Export: Save to HTML for easy sharing
🔗 Reference¶
Based on RAPIDS cuBERT Topic Modelling
🔧 Setup: Install RAPIDS and Dependencies¶
In [1]:
# Install RAPIDS and required packages
# Note: RAPIDS requires specific CUDA versions. See: https://rapids.ai/start.html
# For CUDA 12.x:
# !pip install cudf-cu13 cuml-cu13 --extra-index-url=https://pypi.nvidia.com
# Core dependencies (non-RAPIDS fallback available)
%pip install numpy pandas plotly tqdm pyarrow -q
# Check if RAPIDS is available
try:
import cudf
import cuml
RAPIDS_AVAILABLE = True
print("✅ RAPIDS (cuDF, cuML) is available - GPU acceleration enabled!")
except ImportError:
RAPIDS_AVAILABLE = False
print("⚠️ RAPIDS not available - falling back to CPU (numpy/sklearn)")
print(" To install RAPIDS: pip install cudf-cu13 cuml-cu13 --extra-index-url=https://pypi.nvidia.com")
Note: you may need to restart the kernel to use updated packages. ✅ RAPIDS (cuDF, cuML) is available - GPU acceleration enabled!
In [2]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
from tqdm.auto import tqdm
from typing import List, Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# RAPIDS imports (with CPU fallback)
if RAPIDS_AVAILABLE:
import cudf
import cupy as cp
from cuml.manifold import UMAP as cumlUMAP
from cuml.manifold import TSNE as cumlTSNE
print("✅ RAPIDS imports successful (cuDF, cuML UMAP/TSNE)")
else:
# CPU fallback
try:
from sklearn.manifold import TSNE as sklearnTSNE
import umap as cpuUMAP
print("✅ CPU fallback imports successful (sklearn TSNE, umap-learn)")
except ImportError:
print("⚠️ Installing CPU fallback packages...")
import subprocess
subprocess.run(["pip", "install", "umap-learn", "scikit-learn", "-q"])
from sklearn.manifold import TSNE as sklearnTSNE
import umap as cpuUMAP
# Configure Plotly for notebook rendering
import plotly.io as pio
pio.renderers.default = "notebook_connected"
print(f"\n🖥️ Using: {'GPU (RAPIDS cuML)' if RAPIDS_AVAILABLE else 'CPU (sklearn/umap-learn)'}")
print("✅ All imports successful!")
✅ RAPIDS imports successful (cuDF, cuML UMAP/TSNE) 🖥️ Using: GPU (RAPIDS cuML) ✅ All imports successful!
⚙️ Configuration¶
Configure paths, dataset selection, and visualization parameters.
📂 Path Options¶
| Option | When to Use | How to Configure |
|---|---|---|
| 1. Direct Paths | Simple, one-time setup | Edit paths directly in the cell below |
| 2. Relative Paths | Portable, stays with notebook | Uncomment the relative path block |
| 3. Environment Variables | CI/CD, shared environments | Set EMBEDDINGS_DIR, DATASETS_DIR, OUTPUT_DIR in shell |
📊 Dataset/Split Selection (similar to extract_embeddings_parallel_shards.py)¶
| Option | Example | Description |
|---|---|---|
| A. Load All | LOAD_ALL = True |
Load all available embeddings |
| B. Specific | ["v1:chat", "v2:math"] |
Select exact dataset:split combinations |
| C. By Dataset | ["v1:*", "v2:*"] |
All splits from specified datasets |
| D. By Split | ["*:chat", "*:code"] |
Same split across all datasets |
| E. Mixed | ["v1:*", "*:safety"] |
Combine patterns |
Quick Edit:¶
# Paths
EMBEDDINGS_DIR = Path("/your/path/to/embeddings")
DATASETS_DIR = Path("/your/path/to/datasets")
OUTPUT_DIR = Path("/your/path/to/outputs")
# Selection
LOAD_ALL = False
SELECTED_SPLITS = ["v1:chat", "v1:code", "llama-sft:*"]
In [3]:
# =============================================================================
# CONFIGURATION - EDIT PATHS HERE
# =============================================================================
# ┌─────────────────────────────────────────────────────────────────────────────┐
# │ OPTION 1: Direct Paths (Recommended) │
# │ Simply set absolute paths to your data directories │
# └─────────────────────────────────────────────────────────────────────────────┘
EMBEDDINGS_DIR = Path("/raid/embeddings")
DATASETS_DIR = Path("/raid/datasets")
OUTPUT_DIR = Path("/raid/outputs")
# ┌─────────────────────────────────────────────────────────────────────────────┐
# │ OPTION 2: Relative Paths (uncomment to use) │
# │ Paths relative to notebook location │
# └─────────────────────────────────────────────────────────────────────────────┘
# SCRIPT_DIR = Path(".").absolute()
# EMBEDDINGS_DIR = SCRIPT_DIR / "embeddings"
# DATASETS_DIR = SCRIPT_DIR / "datasets"
# OUTPUT_DIR = SCRIPT_DIR / "outputs"
# ┌─────────────────────────────────────────────────────────────────────────────┐
# │ OPTION 3: Environment Variables (uncomment to use) │
# │ Set via: export EMBEDDINGS_DIR=/path/to/embeddings │
# └─────────────────────────────────────────────────────────────────────────────┘
# EMBEDDINGS_DIR = Path(os.environ.get("EMBEDDINGS_DIR", "./embeddings"))
# DATASETS_DIR = Path(os.environ.get("DATASETS_DIR", "./datasets"))
# OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "./outputs"))
# =============================================================================
# DATASET / SPLIT SELECTION (similar to extract_embeddings_parallel_shards.py)
# =============================================================================
# Format: "dataset:split" - specify exactly which data to visualize
#
# OPTION A: Load ALL available embeddings
# LOAD_ALL = True
# SELECTED_SPLITS = [] # Ignored when LOAD_ALL = True
# OPTION B: Select specific dataset:split combinations (set LOAD_ALL = False)
# LOAD_ALL = False
# SELECTED_SPLITS = [
# "v1:chat",
# "v1:code",
# "v1:math",
# "v2:stem",
# "llama-sft:safety",
# "llama-sft:science",
# ]
# OPTION C: Select by dataset only (all splits from those datasets)
# LOAD_ALL = False
# SELECTED_SPLITS = ["v1:*", "v2:*"] # All splits from v1 and v2
# OPTION D: Select by split only (same split across all datasets)
# LOAD_ALL = False
# SELECTED_SPLITS = ["*:chat", "*:code"] # All chat and code splits
# OPTION E: Mix and match
# LOAD_ALL = False
# SELECTED_SPLITS = [
# "v1:*", # All v1 splits
# "v2:chat", # Only v2 chat
# "*:safety", # Safety from all datasets
# "llama-sft:code", # Specific combination
# ]
LOAD_ALL = False
SELECTED_SPLITS = [
# "llama-sft:chat",
# "llama-sft:code",
# "llama-sft:math",
# "llama-sft:science",
# "llama-sft:safety",
# "llama-sft:stem",
# "llama-sft:tool_calling",
"v2:chat",
"v2:code",
"v2:math",
"v2:stem",
] # Ignored when LOAD_ALL = True
# =============================================================================
# Validate paths exist
# =============================================================================
print("📂 Path Configuration:")
print(f" EMBEDDINGS_DIR: {EMBEDDINGS_DIR}")
print(f" DATASETS_DIR: {DATASETS_DIR}")
print(f" OUTPUT_DIR: {OUTPUT_DIR}")
# Check if paths exist
if not EMBEDDINGS_DIR.exists():
print(f" ⚠️ Warning: EMBEDDINGS_DIR does not exist!")
if not DATASETS_DIR.exists():
print(f" ⚠️ Warning: DATASETS_DIR does not exist!")
# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f" ✅ OUTPUT_DIR created/verified")
# Show selection mode
print(f"\n📊 Data Selection:")
if LOAD_ALL:
print(f" Mode: Load ALL available embeddings")
else:
print(f" Mode: Selected splits only")
for s in SELECTED_SPLITS:
print(f" • {s}")
# =============================================================================
# VISUALIZATION SETTINGS
# =============================================================================
SAMPLE_SIZE = None # None = use all data, or set a number like 100000 for testing
RANDOM_SEED = 42
# =============================================================================
# VISUALIZATION SAMPLING (for dense plots)
# =============================================================================
# Use this to reduce points shown in plots without affecting UMAP computation
# Options:
# VIS_SAMPLE_FRACTION = 1.0 → Show all points (default)
# VIS_SAMPLE_FRACTION = 0.5 → Show 50% of points
# VIS_SAMPLE_FRACTION = 0.1 → Show 10% of points
# VIS_MAX_POINTS = 50000 → Cap at 50k points (overrides fraction if smaller)
# VIS_MAX_POINTS = None → No cap, use fraction only
VIS_SAMPLE_FRACTION = 0.2 # Fraction of points to visualize (0.0 to 1.0)
VIS_MAX_POINTS = None # Maximum points to show (None = no limit)
# Dimensionality reduction settings
REDUCTION_METHOD = "umap" # "umap" or "tsne"
N_COMPONENTS_2D = 2
N_COMPONENTS_3D = 3
# UMAP parameters (GPU-optimized)
UMAP_N_NEIGHBORS = 15
UMAP_MIN_DIST = 0.1
UMAP_METRIC = "cosine"
# t-SNE parameters (GPU-optimized)
TSNE_PERPLEXITY = 30
TSNE_LEARNING_RATE = 200
# Color scheme for categories
CATEGORY_COLORS = {
'chat': '#FF6B6B',
'code': '#4ECDC4',
'math': '#45B7D1',
'stem': '#96CEB4',
'tool_calling': '#FFEAA7',
'science': '#DDA0DD',
'safety': '#FF7F50',
# 'multilingual_ja': '#9B59B6',
# 'multilingual_de': '#3498DB',
# 'multilingual_it': '#E74C3C',
# 'multilingual_es': '#F39C12',
# 'multilingual_fr': '#1ABC9C',
'unknown': '#95A5A6'
}
# Custom colors for datasets
dataset_colors = {
'v1': '#FF6B6B',
'v2': '#4ECDC4',
'llama-sft': '#45B7D1',
'llama-rl': '#96CEB4',
'v3-science': '#DDA0DD',
'v3-math-proofs': '#F39C12',
'v3-instruction-chat': '#9B59B6'
}
print("\n⚙️ Visualization Settings:")
print(f" Reduction method: {REDUCTION_METHOD}")
print(f" Sample size (UMAP): {'All data' if SAMPLE_SIZE is None else f'{SAMPLE_SIZE:,}'}")
print(f" Vis sampling: fraction={VIS_SAMPLE_FRACTION}, max_points={VIS_MAX_POINTS or 'unlimited'}")
print(f" UMAP: n_neighbors={UMAP_N_NEIGHBORS}, min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC}")
print(f" t-SNE: perplexity={TSNE_PERPLEXITY}, learning_rate={TSNE_LEARNING_RATE}")
📂 Path Configuration:
EMBEDDINGS_DIR: /raid/embeddings
DATASETS_DIR: /raid/datasets
OUTPUT_DIR: /raid/outputs
✅ OUTPUT_DIR created/verified
📊 Data Selection:
Mode: Selected splits only
• v2:chat
• v2:code
• v2:math
• v2:stem
⚙️ Visualization Settings:
Reduction method: umap
Sample size (UMAP): All data
Vis sampling: fraction=0.2, max_points=unlimited
UMAP: n_neighbors=15, min_dist=0.1, metric=cosine
t-SNE: perplexity=30, learning_rate=200
📥 Load Extracted Embeddings¶
Load pre-extracted embeddings from parquet files in embeddings_output/ directory.
These were generated by extract_embeddings_parallel_shards.py.
In [4]:
def matches_selection(dataset: str, split: str, selected_splits: List[str]) -> bool:
"""
Check if a dataset:split combination matches any of the selection patterns.
Supports wildcards:
- "v1:chat" - exact match
- "v1:*" - all splits from v1
- "*:chat" - chat split from all datasets
- "*:*" - everything (same as LOAD_ALL=True)
Args:
dataset: Dataset name (e.g., "v1", "llama-sft")
split: Split name (e.g., "chat", "code")
selected_splits: List of selection patterns
Returns:
True if matches any pattern
"""
for pattern in selected_splits:
if ':' not in pattern:
# Treat as dataset-only pattern
if pattern == dataset or pattern == '*':
return True
continue
pat_dataset, pat_split = pattern.split(':', 1)
# Check dataset match
dataset_match = (pat_dataset == '*' or pat_dataset == dataset)
# Check split match
split_match = (pat_split == '*' or pat_split == split)
if dataset_match and split_match:
return True
return False
def discover_embedding_files(
embeddings_dir: Path,
load_all: bool = True,
selected_splits: Optional[List[str]] = None
) -> List[Dict]:
"""
Discover parquet embedding files with optional filtering.
Args:
embeddings_dir: Path to embeddings directory
load_all: If True, load all available embeddings
selected_splits: List of "dataset:split" patterns to filter
Supports wildcards: "v1:*", "*:chat", "v1:chat"
Returns:
List of dicts with: dataset, split, shard_idx, total_shards, filepath
"""
files = []
if not embeddings_dir.exists():
print(f"⚠️ Embeddings directory not found: {embeddings_dir}")
print(" Run extract_embeddings_parallel_shards.py first to generate embeddings.")
return files
if selected_splits is None:
selected_splits = []
# Track what we find vs what was requested
found_combinations = set()
# Walk through embeddings directory
for dataset_dir in sorted(embeddings_dir.iterdir()):
if not dataset_dir.is_dir():
continue
dataset_name = dataset_dir.name
for split_dir in sorted(dataset_dir.iterdir()):
if not split_dir.is_dir():
continue
split_name = split_dir.name
# Check if this combination should be included
if not load_all and selected_splits:
if not matches_selection(dataset_name, split_name, selected_splits):
continue
found_combinations.add(f"{dataset_name}:{split_name}")
# Find all parquet files
parquet_files = sorted(split_dir.glob("*.parquet"))
for pq_file in parquet_files:
# Parse filename: v1-chat-00000-of-00001.parquet
parts = pq_file.stem.split("-")
if len(parts) >= 4 and "of" in parts:
of_idx = parts.index("of")
shard_idx = int(parts[of_idx - 1])
total_shards = int(parts[of_idx + 1])
else:
shard_idx = 0
total_shards = 1
files.append({
'dataset': dataset_name,
'split': split_name,
'shard_idx': shard_idx,
'total_shards': total_shards,
'filepath': pq_file
})
# Show what was found
if not load_all and selected_splits:
print(f"\n🎯 Selection filter active:")
for pattern in selected_splits:
print(f" • {pattern}")
print(f"\n Matched {len(found_combinations)} dataset:split combination(s)")
return files
def load_embeddings_from_parquet(
file_infos: List[Dict],
sample_size: Optional[int] = None,
random_seed: int = 42
) -> Tuple[np.ndarray, pd.DataFrame, Dict[str, np.ndarray]]:
"""
Load embeddings from parquet files into numpy array and metadata DataFrame.
Also loads cached UMAP coordinates if available.
Args:
file_infos: List of file info dicts from discover_embedding_files()
sample_size: Optional limit on total samples to load
random_seed: Random seed for sampling
Returns:
Tuple of (embeddings_array, metadata_df, cached_umap_dict)
cached_umap_dict contains 'umap_2d' and 'umap_3d' arrays if found, else empty
"""
all_embeddings = []
all_metadata = []
all_umap_2d = []
all_umap_3d = []
has_umap_2d = True
has_umap_3d = True
np.random.seed(random_seed)
print(f"📂 Loading embeddings from {len(file_infos)} parquet file(s)...")
for file_info in tqdm(file_infos, desc="Loading files"):
filepath = file_info['filepath']
try:
# Load parquet file (always use pandas for consistency)
df_pd = pd.read_parquet(str(filepath))
embeddings = df_pd['embeddings'].tolist()
indices = df_pd['original_index'].tolist() if 'original_index' in df_pd.columns else list(range(len(df_pd)))
# Check for cached UMAP 2D coordinates
if has_umap_2d and 'umap_2d_x' in df_pd.columns and 'umap_2d_y' in df_pd.columns:
umap_2d_coords = df_pd[['umap_2d_x', 'umap_2d_y']].values
all_umap_2d.extend(umap_2d_coords.tolist())
else:
has_umap_2d = False
# Check for cached UMAP 3D coordinates
if has_umap_3d and 'umap_3d_x' in df_pd.columns and 'umap_3d_y' in df_pd.columns and 'umap_3d_z' in df_pd.columns:
umap_3d_coords = df_pd[['umap_3d_x', 'umap_3d_y', 'umap_3d_z']].values
all_umap_3d.extend(umap_3d_coords.tolist())
else:
has_umap_3d = False
# Add embeddings and metadata
for i, (emb, idx) in enumerate(zip(embeddings, indices)):
all_embeddings.append(emb)
all_metadata.append({
'dataset': file_info['dataset'],
'split': file_info['split'],
'shard_idx': file_info['shard_idx'],
'original_index': idx,
'label': file_info['split'], # Default label = split name
'filepath': str(filepath) # Track source file for saving back
})
except Exception as e:
print(f" ⚠️ Error loading {filepath.name}: {e}")
import traceback
traceback.print_exc()
continue
if not all_embeddings:
raise ValueError("No embeddings loaded! Check the embeddings directory.")
# Convert to numpy array
embeddings_array = np.array(all_embeddings, dtype=np.float32)
metadata_df = pd.DataFrame(all_metadata)
# Build cached UMAP dict
cached_umap = {}
if has_umap_2d and len(all_umap_2d) == len(embeddings_array):
cached_umap['umap_2d'] = np.array(all_umap_2d, dtype=np.float32)
print(f" ✅ Loaded cached UMAP 2D coordinates")
if has_umap_3d and len(all_umap_3d) == len(embeddings_array):
cached_umap['umap_3d'] = np.array(all_umap_3d, dtype=np.float32)
print(f" ✅ Loaded cached UMAP 3D coordinates")
# Sample if requested
if sample_size is not None and sample_size < len(embeddings_array):
print(f" 📊 Sampling {sample_size:,} from {len(embeddings_array):,} embeddings...")
sample_indices = np.random.choice(len(embeddings_array), size=sample_size, replace=False)
embeddings_array = embeddings_array[sample_indices]
metadata_df = metadata_df.iloc[sample_indices].reset_index(drop=True)
# Also sample cached UMAP if present
if 'umap_2d' in cached_umap:
cached_umap['umap_2d'] = cached_umap['umap_2d'][sample_indices]
if 'umap_3d' in cached_umap:
cached_umap['umap_3d'] = cached_umap['umap_3d'][sample_indices]
print(f"\n✅ Loaded {len(embeddings_array):,} embeddings")
print(f" Embedding dimension: {embeddings_array.shape[1]}")
print(f" Datasets: {metadata_df['dataset'].unique().tolist()}")
print(f" Splits: {metadata_df['split'].unique().tolist()}")
if cached_umap:
print(f" Cached UMAP: {list(cached_umap.keys())}")
else:
print(f" Cached UMAP: None (will compute)")
return embeddings_array, metadata_df, cached_umap
def save_umap_to_parquets(
metadata_df: pd.DataFrame,
embeddings_2d: Optional[np.ndarray] = None,
embeddings_3d: Optional[np.ndarray] = None
) -> None:
"""
Save computed UMAP coordinates back to the original parquet files.
This function updates each parquet file with new UMAP coordinate columns,
preserving the original data and adding/updating only the UMAP columns.
Args:
metadata_df: DataFrame with 'filepath' column indicating source files
embeddings_2d: Optional 2D UMAP coordinates (n_samples, 2)
embeddings_3d: Optional 3D UMAP coordinates (n_samples, 3)
"""
if embeddings_2d is None and embeddings_3d is None:
print("⚠️ No UMAP coordinates to save")
return
# Group metadata by filepath to process each file
grouped = metadata_df.groupby('filepath')
print(f"💾 Saving UMAP coordinates to {len(grouped)} parquet file(s)...")
for filepath, group_df in tqdm(grouped, desc="Saving to parquets"):
filepath = Path(filepath)
try:
# Get indices in the global array for this file
global_indices = group_df.index.tolist()
# Load existing parquet
df_existing = pd.read_parquet(str(filepath))
# Add UMAP 2D columns if provided
if embeddings_2d is not None:
umap_2d_for_file = embeddings_2d[global_indices]
df_existing['umap_2d_x'] = umap_2d_for_file[:, 0]
df_existing['umap_2d_y'] = umap_2d_for_file[:, 1]
# Add UMAP 3D columns if provided
if embeddings_3d is not None:
umap_3d_for_file = embeddings_3d[global_indices]
df_existing['umap_3d_x'] = umap_3d_for_file[:, 0]
df_existing['umap_3d_y'] = umap_3d_for_file[:, 1]
df_existing['umap_3d_z'] = umap_3d_for_file[:, 2]
# Save back to parquet
df_existing.to_parquet(str(filepath), index=False)
except Exception as e:
print(f" ⚠️ Error saving to {filepath.name}: {e}")
import traceback
traceback.print_exc()
continue
print(f"✅ UMAP coordinates saved to parquet files")
if embeddings_2d is not None:
print(f" • umap_2d_x, umap_2d_y")
if embeddings_3d is not None:
print(f" • umap_3d_x, umap_3d_y, umap_3d_z")
In [5]:
# Discover and load embeddings
print("=" * 80)
print("📂 Discovering embedding files...")
print("=" * 80)
# Use selection parameters from configuration
file_infos = discover_embedding_files(
EMBEDDINGS_DIR,
load_all=LOAD_ALL,
selected_splits=SELECTED_SPLITS
)
if file_infos:
# Show discovered files
print(f"\n📋 Found {len(file_infos)} embedding file(s):\n")
# Group by dataset and split
from collections import defaultdict
grouped = defaultdict(list)
for f in file_infos:
grouped[f"{f['dataset']}/{f['split']}"].append(f)
for key, files in sorted(grouped.items()):
total_shards = files[0]['total_shards']
print(f" {key}: {len(files)} shard(s) of {total_shards}")
# Load embeddings
print("\n" + "=" * 80)
embeddings, metadata_df, cached_umap = load_embeddings_from_parquet(file_infos, sample_size=SAMPLE_SIZE)
print("=" * 80)
# Display metadata distribution
print("\n📊 Data Distribution:")
print(f"\nBy Dataset:")
print(metadata_df['dataset'].value_counts().to_string())
print(f"\nBy Split (Label):")
print(metadata_df['split'].value_counts().to_string())
else:
print("\n⚠️ No embedding files found!")
print(" Expected directory structure:")
print(" embeddings/")
print(" ├── v1/")
print(" │ ├── chat/")
print(" │ │ └── v1-chat-00000-of-00001.parquet")
print(" │ └── ...")
print(" └── ...")
print("\n Run: python extract_embeddings_parallel_shards.py --all")
embeddings = None
metadata_df = None
cached_umap = {}
================================================================================ 📂 Discovering embedding files... ================================================================================ 🎯 Selection filter active: • v2:chat • v2:code • v2:math • v2:stem Matched 4 dataset:split combination(s) 📋 Found 18 embedding file(s): v2/chat: 12 shard(s) of 12 v2/code: 2 shard(s) of 2 v2/math: 2 shard(s) of 2 v2/stem: 2 shard(s) of 2 ================================================================================ 📂 Loading embeddings from 18 parquet file(s)...
Loading files: 0%| | 0/18 [00:00<?, ?it/s]
✅ Loaded cached UMAP 2D coordinates ✅ Loaded cached UMAP 3D coordinates ✅ Loaded 1,397,187 embeddings Embedding dimension: 4096 Datasets: ['v2'] Splits: ['chat', 'code', 'math', 'stem'] Cached UMAP: ['umap_2d', 'umap_3d'] ================================================================================ 📊 Data Distribution: By Dataset: dataset v2 1397187 By Split (Label): split chat 627720 stem 355000 math 239467 code 175000
🚀 GPU-Accelerated Dimensionality Reduction¶
Apply cuML UMAP or cuML t-SNE for GPU-accelerated dimensionality reduction. This is 10-100x faster than CPU-based methods for large datasets.
In [6]:
def apply_umap_gpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
"""
Apply GPU-accelerated UMAP using cuML.
Args:
embeddings: Input embeddings (n_samples, n_features)
n_components: Output dimensions (2 or 3)
Returns:
Reduced embeddings (n_samples, n_components)
"""
print(f"🚀 Applying cuML UMAP (GPU-accelerated)...")
print(f" Input shape: {embeddings.shape}")
print(f" Output dimensions: {n_components}")
print(f" Parameters: n_neighbors={UMAP_N_NEIGHBORS}, min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC}")
# Convert to cupy array for GPU processing
embeddings_gpu = cp.asarray(embeddings, dtype=cp.float32)
# Initialize cuML UMAP
reducer = cumlUMAP(
n_components=n_components,
n_neighbors=UMAP_N_NEIGHBORS,
min_dist=UMAP_MIN_DIST,
metric=UMAP_METRIC,
random_state=RANDOM_SEED,
verbose=True
)
# Fit and transform
reduced = reducer.fit_transform(embeddings_gpu)
# Convert back to numpy
result = cp.asnumpy(reduced)
print(f"✅ UMAP complete! Output shape: {result.shape}")
return result
def apply_tsne_gpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
"""
Apply GPU-accelerated t-SNE using cuML.
Args:
embeddings: Input embeddings (n_samples, n_features)
n_components: Output dimensions (2 or 3)
Returns:
Reduced embeddings (n_samples, n_components)
"""
print(f"🚀 Applying cuML t-SNE (GPU-accelerated)...")
print(f" Input shape: {embeddings.shape}")
print(f" Output dimensions: {n_components}")
print(f" Parameters: perplexity={TSNE_PERPLEXITY}, learning_rate={TSNE_LEARNING_RATE}")
# Convert to cupy array for GPU processing
embeddings_gpu = cp.asarray(embeddings, dtype=cp.float32)
# Initialize cuML t-SNE
reducer = cumlTSNE(
n_components=n_components,
perplexity=TSNE_PERPLEXITY,
learning_rate=TSNE_LEARNING_RATE,
random_state=RANDOM_SEED,
verbose=True
)
# Fit and transform
reduced = reducer.fit_transform(embeddings_gpu)
# Convert back to numpy
result = cp.asnumpy(reduced)
print(f"✅ t-SNE complete! Output shape: {result.shape}")
return result
def apply_umap_cpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
"""CPU fallback for UMAP using umap-learn."""
print(f"🐢 Applying CPU UMAP (umap-learn)...")
print(f" Input shape: {embeddings.shape}")
reducer = cpuUMAP.UMAP(
n_components=n_components,
n_neighbors=UMAP_N_NEIGHBORS,
min_dist=UMAP_MIN_DIST,
metric=UMAP_METRIC,
random_state=RANDOM_SEED,
verbose=True
)
result = reducer.fit_transform(embeddings)
print(f"✅ UMAP complete! Output shape: {result.shape}")
return result
def apply_tsne_cpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
"""CPU fallback for t-SNE using sklearn."""
print(f"🐢 Applying CPU t-SNE (sklearn)...")
print(f" Input shape: {embeddings.shape}")
reducer = sklearnTSNE(
n_components=n_components,
perplexity=TSNE_PERPLEXITY,
learning_rate=TSNE_LEARNING_RATE,
random_state=RANDOM_SEED,
verbose=1
)
result = reducer.fit_transform(embeddings)
print(f"✅ t-SNE complete! Output shape: {result.shape}")
return result
def reduce_dimensions(
embeddings: np.ndarray,
method: str = "umap",
n_components: int = 2
) -> np.ndarray:
"""
Apply dimensionality reduction using GPU if available, else CPU.
Args:
embeddings: Input embeddings
method: "umap" or "tsne"
n_components: 2 or 3
Returns:
Reduced embeddings
"""
if RAPIDS_AVAILABLE:
if method == "umap":
return apply_umap_gpu(embeddings, n_components)
else:
return apply_tsne_gpu(embeddings, n_components)
else:
if method == "umap":
return apply_umap_cpu(embeddings, n_components)
else:
return apply_tsne_cpu(embeddings, n_components)
In [7]:
# Apply dimensionality reduction (2D and 3D) with caching
if embeddings is not None:
print("=" * 80)
print(f"🗺️ {REDUCTION_METHOD.upper()} Dimensionality Reduction (with caching)")
print("=" * 80)
computed_2d = False
computed_3d = False
# =========================================================================
# 2D Reduction - Check cache first
# =========================================================================
if REDUCTION_METHOD == "umap" and 'umap_2d' in cached_umap and SAMPLE_SIZE is None:
print(f"\n📊 2D Projection: Using cached UMAP coordinates")
embeddings_2d = cached_umap['umap_2d']
else:
print(f"\n📊 2D Projection: Computing {REDUCTION_METHOD.upper()}...")
embeddings_2d = reduce_dimensions(embeddings, method=REDUCTION_METHOD, n_components=2)
computed_2d = True
# Add to metadata
metadata_df['x2d'] = embeddings_2d[:, 0]
metadata_df['y2d'] = embeddings_2d[:, 1]
# =========================================================================
# 3D Reduction - Check cache first
# =========================================================================
if REDUCTION_METHOD == "umap" and 'umap_3d' in cached_umap and SAMPLE_SIZE is None:
print("\n📊 3D Projection: Using cached UMAP coordinates")
embeddings_3d = cached_umap['umap_3d']
else:
print(f"\n📊 3D Projection: Computing {REDUCTION_METHOD.upper()}...")
embeddings_3d = reduce_dimensions(embeddings, method=REDUCTION_METHOD, n_components=3)
computed_3d = True
# Add to metadata
metadata_df['x3d'] = embeddings_3d[:, 0]
metadata_df['y3d'] = embeddings_3d[:, 1]
metadata_df['z3d'] = embeddings_3d[:, 2]
# =========================================================================
# Save newly computed UMAP coordinates to parquet files
# =========================================================================
if REDUCTION_METHOD == "umap" and (computed_2d or computed_3d) and SAMPLE_SIZE is None:
print("\n" + "-" * 40)
save_umap_to_parquets(
metadata_df,
embeddings_2d=embeddings_2d if computed_2d else None,
embeddings_3d=embeddings_3d if computed_3d else None
)
print("-" * 40)
print("\n" + "=" * 80)
print("✅ Dimensionality reduction complete!")
print(f" 2D: {'computed' if computed_2d else 'cached'}, 3D: {'computed' if computed_3d else 'cached'}")
print(f" 2D range: x=[{metadata_df['x2d'].min():.2f}, {metadata_df['x2d'].max():.2f}], y=[{metadata_df['y2d'].min():.2f}, {metadata_df['y2d'].max():.2f}]")
print(f" 3D range: x=[{metadata_df['x3d'].min():.2f}, {metadata_df['x3d'].max():.2f}], y=[{metadata_df['y3d'].min():.2f}, {metadata_df['y3d'].max():.2f}], z=[{metadata_df['z3d'].min():.2f}, {metadata_df['z3d'].max():.2f}]")
print("=" * 80)
else:
print("⚠️ No embeddings loaded - skipping dimensionality reduction")
================================================================================ 🗺️ UMAP Dimensionality Reduction (with caching) ================================================================================ 📊 2D Projection: Using cached UMAP coordinates 📊 3D Projection: Using cached UMAP coordinates ================================================================================ ✅ Dimensionality reduction complete! 2D: cached, 3D: cached 2D range: x=[-224.28, 124.49], y=[-184.24, 92.46] 3D range: x=[-217.08, 126.65], y=[-81.15, 247.46], z=[-175.15, 137.99] ================================================================================
📊 Interactive Plotly Visualizations¶
Create 2D and 3D interactive visualizations with Plotly. Labels come from dataset metadata (split names). Future versions can use cuBERT clustering.
In [8]:
def sample_for_visualization(
df: pd.DataFrame,
fraction: float = 1.0,
max_points: Optional[int] = None,
random_seed: int = 42
) -> pd.DataFrame:
"""
Sample data for visualization to reduce density.
Args:
df: Input DataFrame with visualization data
fraction: Fraction of points to keep (0.0 to 1.0)
max_points: Maximum number of points (overrides fraction if smaller)
random_seed: Random seed for reproducible sampling
Returns:
Sampled DataFrame
"""
n_source = len(df)
# Calculate target size based on fraction
n_target = int(n_source * fraction)
# Apply max_points cap if specified
if max_points is not None:
n_target = min(n_target, max_points)
# Don't sample if we're keeping everything
if n_target >= n_source:
return df
# Sample the data
np.random.seed(random_seed)
sampled_df = df.sample(n=n_target, random_state=random_seed).reset_index(drop=True)
print(f" 🎲 Visualization sampling: {n_source:,} → {n_target:,} points ({n_target/n_source*100:.1f}%)")
return sampled_df
def calculate_adaptive_opacity(n_points: int, min_opacity: float = 0.6, max_opacity: float = 0.8) -> float:
"""
Calculate adaptive opacity based on number of points.
More points = lower opacity to better show density.
Args:
n_points: Number of data points
min_opacity: Minimum opacity (for very large datasets)
max_opacity: Maximum opacity (for small datasets)
Returns:
Opacity value between min_opacity and max_opacity
"""
# Thresholds for opacity scaling
low_threshold = 1000 # Below this, use max opacity
high_threshold = 500000 # Above this, use min opacity
if n_points <= low_threshold:
return max_opacity
elif n_points >= high_threshold:
return min_opacity
else:
# Logarithmic scaling for smooth transition
log_range = np.log10(high_threshold) - np.log10(low_threshold)
log_pos = np.log10(n_points) - np.log10(low_threshold)
ratio = log_pos / log_range
return max_opacity - (max_opacity - min_opacity) * ratio
def calculate_adaptive_marker_size(n_points: int, min_size: float = 6, max_size: float = 8) -> float:
"""
Calculate adaptive marker size based on number of points.
More points = smaller markers.
Args:
n_points: Number of data points
min_size: Minimum marker size (for very large datasets)
max_size: Maximum marker size (for small datasets)
Returns:
Marker size value
"""
low_threshold = 1000
high_threshold = 5000
if n_points <= low_threshold:
return max_size
elif n_points >= high_threshold:
return min_size
else:
log_range = np.log10(high_threshold) - np.log10(low_threshold)
log_pos = np.log10(n_points) - np.log10(low_threshold)
ratio = log_pos / log_range
return max_size - (max_size - min_size) * ratio
def create_2d_scatter(
df: pd.DataFrame,
color_col: str = 'split',
title: str = "2D Embedding Visualization",
color_map: Optional[Dict] = None
) -> go.Figure:
"""
Create interactive 2D scatter plot with Plotly.
Automatically adjusts opacity and marker size based on data density.
Args:
df: DataFrame with x, y columns and metadata
color_col: Column to use for coloring points
title: Plot title
color_map: Optional custom color mapping
Returns:
Plotly Figure object
"""
# Use custom colors if provided
if color_map is None:
color_map = CATEGORY_COLORS
# Get unique values and assign colors
unique_vals = df[color_col].unique()
colors = {val: color_map.get(val, '#95A5A6') for val in unique_vals}
# Calculate adaptive rendering parameters based on density
n_points = len(df)
opacity = calculate_adaptive_opacity(n_points)
marker_size = calculate_adaptive_marker_size(n_points)
print(f" 📊 Adaptive rendering: {n_points:,} points → opacity={opacity:.2f}, size={marker_size:.1f}")
fig = px.scatter(
df,
x='x2d',
y='y2d',
color=color_col,
color_discrete_map=colors,
hover_data=['dataset', 'split', 'label'],
title=title,
labels={'x': f'{REDUCTION_METHOD.upper()} Dimension 1', 'y': f'{REDUCTION_METHOD.upper()} Dimension 2'},
template='plotly_white'
)
fig.update_traces(
marker=dict(size=marker_size, opacity=opacity, line=dict(width=0.2, color='white'))
)
fig.update_layout(
width=1000,
height=900,
title_font_size=20,
title_x=0.5,
legend=dict(
title=color_col.title(),
yanchor="top",
y=0.99,
xanchor="left",
x=1.01,
bgcolor="rgba(255, 255, 255, 0.9)",
bordercolor="gray",
borderwidth=1
),
hovermode='closest'
)
return fig
def create_3d_scatter(
df: pd.DataFrame,
color_col: str = 'split',
title: str = "3D Embedding Visualization",
color_map: Optional[Dict] = None
) -> go.Figure:
"""
Create interactive 3D scatter plot with Plotly.
Automatically adjusts opacity and marker size based on data density.
Args:
df: DataFrame with x3d, y3d, z3d columns and metadata
color_col: Column to use for coloring points
title: Plot title
color_map: Optional custom color mapping
Returns:
Plotly Figure object
"""
# Use custom colors if provided
if color_map is None:
color_map = CATEGORY_COLORS
# Get unique values and assign colors
unique_vals = df[color_col].unique()
colors = {val: color_map.get(val, '#95A5A6') for val in unique_vals}
# Calculate adaptive rendering parameters based on density
# Use slightly smaller sizes for 3D to avoid clutter
n_points = len(df)
opacity = calculate_adaptive_opacity(n_points, min_opacity=0.6, max_opacity=0.8)
marker_size = calculate_adaptive_marker_size(n_points, min_size=5.0, max_size=8)
print(f" 📊 Adaptive rendering (3D): {n_points:,} points → opacity={opacity:.2f}, size={marker_size:.1f}")
fig = px.scatter_3d(
df,
x='x3d',
y='y3d',
z='z3d',
color=color_col,
color_discrete_map=colors,
hover_data=['dataset', 'split', 'label'],
title=title,
labels={
'x3d': f'{REDUCTION_METHOD.upper()} Dim 1',
'y3d': f'{REDUCTION_METHOD.upper()} Dim 2',
'z3d': f'{REDUCTION_METHOD.upper()} Dim 3'
},
template='plotly_white'
)
fig.update_traces(
marker=dict(size=marker_size, opacity=opacity, line=dict(width=0.1, color='white'))
)
fig.update_layout(
width=1000,
height=900,
title_font_size=20,
title_x=0.5,
scene=dict(
xaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 1',
yaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 2',
zaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 3',
camera=dict(eye=dict(x=1.5, y=1.5, z=1.2))
),
legend=dict(
title=color_col.title(),
yanchor="top",
y=0.99,
xanchor="left",
x=0.01,
bgcolor="rgba(255, 255, 255, 0.9)",
bordercolor="gray",
borderwidth=1
)
)
return fig
In [9]:
# Create and display visualizations
if metadata_df is not None and 'x2d' in metadata_df.columns:
print("=" * 80)
print("📊 Creating Interactive Visualizations")
print("=" * 80)
# =========================================================================
# Apply visualization sampling (reduces density for plotting only)
# =========================================================================
vis_df = sample_for_visualization(
metadata_df,
fraction=VIS_SAMPLE_FRACTION,
max_points=VIS_MAX_POINTS,
random_seed=RANDOM_SEED
)
# =========================================================================
# Visualization 1: 2D scatter by Split (default label)
# =========================================================================
print("\n🎨 Creating 2D visualization colored by Split...")
fig_2d_split = create_2d_scatter(
vis_df,
color_col='split',
title=f'{REDUCTION_METHOD.upper()} 2D Projection - Colored by Split'
)
# Save to HTML
output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_2d_by_split.html'
fig_2d_split.write_html(str(output_file))
print(f" ✅ Saved: {output_file}")
# Display
fig_2d_split.show()
else:
print("⚠️ No data to visualize - run previous cells first")
================================================================================ 📊 Creating Interactive Visualizations ================================================================================ 🎲 Visualization sampling: 1,397,187 → 279,437 points (20.0%) 🎨 Creating 2D visualization colored by Split... 📊 Adaptive rendering: 279,437 points → opacity=0.62, size=6.0 ✅ Saved: /raid/outputs/umap_2d_by_split.html
In [10]:
# =========================================================================
# Visualization 2: 2D scatter by Dataset
# =========================================================================
if 'vis_df' in dir() and vis_df is not None and 'x2d' in vis_df.columns:
print("\n🎨 Creating 2D visualization colored by Dataset...")
fig_2d_dataset = create_2d_scatter(
vis_df,
color_col='dataset',
title=f'{REDUCTION_METHOD.upper()} 2D Projection - Colored by Dataset',
color_map=dataset_colors
)
# Save to HTML
output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_2d_by_dataset.html'
fig_2d_dataset.write_html(str(output_file))
print(f" ✅ Saved: {output_file}")
# Display
fig_2d_dataset.show()
else:
print("⚠️ No data to visualize - run previous cells first")
🎨 Creating 2D visualization colored by Dataset... 📊 Adaptive rendering: 279,437 points → opacity=0.62, size=6.0 ✅ Saved: /raid/outputs/umap_2d_by_dataset.html
In [11]:
# =========================================================================
# Visualization 3: 3D scatter by Split
# =========================================================================
if 'vis_df' in dir() and vis_df is not None and 'x3d' in vis_df.columns:
print("\n🎨 Creating 3D visualization colored by Split...")
fig_3d_split = create_3d_scatter(
vis_df,
color_col='split',
title=f'{REDUCTION_METHOD.upper()} 3D Projection - Colored by Split'
)
# Save to HTML
output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_3d_by_split.html'
fig_3d_split.write_html(str(output_file))
print(f" ✅ Saved: {output_file}")
# Display
fig_3d_split.show()
else:
print("⚠️ No 3D data to visualize - run previous cells first")
🎨 Creating 3D visualization colored by Split... 📊 Adaptive rendering (3D): 279,437 points → opacity=0.62, size=5.0 ✅ Saved: /raid/outputs/umap_3d_by_split.html